Skeleton ToC so heading numbers match full report
Code and results for this section. Full report at:
NB This is a bare bones .Rmd file. Please see the report for full details.
Load the data - sourced from https://reshare.ukdataservice.ac.uk/853334/
We use the GREEN Grid sample of ~ 40 New Zealand dwellings/households to give worked examples of the scaling methods described.
# half-hourly total household consumption (pre-prepared)
totalF <- paste0(rmdParams$ggPath, "halfHourImputedTotalDemand_Fixed.csv.gz")
# half-hourly lighting consumption
lightingF <- paste0(rmdParams$ggPath, "halfHourLighting.csv.gz")
# half-hourly heat pump consumption
heatPumpF <- paste0(rmdParams$ggPath, "halfHourHeatPump.csv.gz")
# household attribute data
hhF <- paste0(rmdParams$ggPath, "ggHouseholdAttributesSafe_2019-04-09.csv.gz")
# weights created using IPF in SPATIALEC
hhWf <- paste0(rmdParams$ggPath, "ipf/nonZeroWeightsAu2013.csv.gz")
Now prep and check the data:
# > power ----
totalDT <- data.table::fread(totalF)
setkey(totalDT, linkID)
message("N households in total")
## N households in total
uniqueN(totalDT$linkID)
## [1] 40
heatPumpDT <- data.table::fread(heatPumpF)
message("N households with monitored heat pump circuits")
## N households with monitored heat pump circuits
uniqueN(heatPumpDT$linkID)
## [1] 29
setkey(heatPumpDT, linkID)
lightingDT <- data.table::fread(lightingF)
message("N households with monitored lighting circuits")
## N households with monitored lighting circuits
uniqueN(lightingDT$linkID)
## [1] 23
setkey(lightingDT, linkID)
powerDataPrep <- function(dt){
dt[, meanConsumptionkWh := (meanPowerW/2)/1000]
dt[, r_as_dateTime := lubridate::as_datetime(r_dateTimeHalfHour)]
dt[, r_dateTimeNZ := lubridate::with_tz(r_as_dateTime, "Pacific/Auckland")]
dt[, r_date := lubridate::as_date(r_dateTimeNZ)]
dt[, r_halfHour := hms::as_hms(r_dateTimeNZ)]
dt[, r_year := lubridate::year(r_dateTimeNZ)]
dt <- addNZSeason(dt, dateTime = "r_dateTimeNZ")
return(dt)
}
totalDT <- powerDataPrep(totalDT)
heatPumpDT <- powerDataPrep(heatPumpDT)
lightingDT <- powerDataPrep(lightingDT)
Now check the data and remove any recommended for exclusion on quality grounds (rf_46)
# check
#table(totalDT$circuit)
t <- totalDT[, .(mean_kWh = mean(meanConsumptionkWh),
nHouseholds = uniqueN(linkID),
startDate = min(r_date),
endDate = max(r_date)), keyby = .(r_year)]
ft <- set_caption(flextable(t),
caption = "Data availability")
flextable::autofit(ft)
r_year | mean_kWh | nHouseholds | startDate | endDate |
2014 | 0.4482226 | 21 | 2014-01-06 | 2014-12-31 |
2015 | 0.4599591 | 39 | 2015-01-01 | 2015-12-31 |
2016 | 0.4295491 | 30 | 2016-01-01 | 2016-12-31 |
2017 | 0.4519507 | 22 | 2017-01-01 | 2017-12-31 |
2018 | 0.4420814 | 18 | 2018-01-01 | 2018-08-01 |
totalDT <- totalDT[r_year == 2015 & linkID != "rf_46"]
heatPumpDT <- heatPumpDT[r_year == 2015 & linkID != "rf_46"]
lightingDT <- lightingDT[r_year == 2015 & linkID != "rf_46"]
We have data for multiple years. For simplicity we will keep only the data for 2015 when we have the largest number of households (see Table 3.1).
Check that the time of day demand profiles look OK. Figure 3.1 shows overall kWh profiles for all dwellings with electricity use monitoring.
# check - beware which hemisphere we are in?
table(totalDT$month, totalDT$season)
##
## Spring Summer Autumn Winter
## Jan 0 29054 0 0
## Feb 0 27908 0 0
## Mar 0 0 35090 0
## Apr 0 0 51364 0
## May 0 0 52157 0
## Jun 0 0 0 47024
## Jul 0 0 0 46487
## Aug 0 0 0 45521
## Sep 43036 0 0 0
## Oct 43633 0 0 0
## Nov 39667 0 0 0
## Dec 0 38840 0 0
testDT <- totalDT[, .(meankWh = mean(meanConsumptionkWh)),
keyby = .(r_halfHour, season, r_year)]
ggplot2::ggplot(testDT, aes(x = r_halfHour, y = meankWh, colour = season)) +
geom_line() +
labs(caption = "Whole household mean kWh per half hour") +
facet_grid(. ~ r_year)
Figure 3.1: Overall kWh profile by season
Figure 3.2 shows lighting kWh profiles for all dwellings with electricity use monitoring.
testDT <- lightingDT[, .(meankWh = mean(meanConsumptionkWh)),
keyby = .(r_halfHour, season, r_year)]
ggplot2::ggplot(testDT, aes(x = r_halfHour, y = meankWh, colour = season)) +
geom_line() +
labs(caption = "Lighting mean kWh per half hour") +
facet_grid(. ~ r_year)
Figure 3.2: Lighting profile by season
Both Figure @ref(fig:checkProfile_all) and Figure 3.2 show the time of day profiles we would expect.
Process and check household data
# > household ----
hhDT <- data.table::fread(hhF)
hhDT <- code_Q7(hhDT)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hhDT[, hasPV := `PV Inverter`]
hhDT[, hasPV := ifelse(hasPV == "", "No", hasPV)]
hhDT[, hasSurvey := "Yes"]
hhDT[, nPeople := nAdults + nChildren0_12 + nTeenagers13_18]
hhDT[, nPeople_Factor := ifelse(nPeople > 1, "2", "1")]
hhDT[, nPeople_Factor := ifelse(nPeople > 2, "3", nPeople_Factor)]
hhDT[, nPeople_Factor := ifelse(nPeople > 3, "4+", nPeople_Factor)]
# check
message("Number of households in survey data")
## Number of households in survey data
uniqueN(hhDT$linkID)
## [1] 44
setkey(hhDT, linkID)
setkey(totalDT, linkID)
dt <- totalDT[, .(nObs = .N), keyby = .(linkID)]
dt[, hasElecMonitor := "Yes"]
setkey(dt, linkID)
hhDT <- merge(hhDT, dt, all = TRUE) # keep all the linkIDs
t <- hhDT[, .(n = .N), keyby = .(hasSurvey, hasElecMonitor)]
set_caption(flextable(t),caption = "Number of households with/without survey and monitoring data")
hasSurvey | hasElecMonitor | n |
Yes | 2 | |
Yes | 8 | |
Yes | Yes | 36 |
Note that not all households in the survey data have electricity monitoring data and vice versa.
Now check for 0 and negative values in the dwelling overall total kWh.
# any zeros & negative numbers?
#hist(totalDT$meanConsumptionkWh)
ggplot2::ggplot(totalDT, aes(x = meanConsumptionkWh)) +
geom_histogram() +
labs(caption = "Mean half-hourly data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# some
# check if aggregated to daily sum
dailyAllDT <- totalDT[, .(sumkWh = sum(meanConsumptionkWh), # daily sum
meankWh = mean(meanConsumptionkWh) # daily mean
),
keyby = .(r_date, linkID, season)]
ggplot2::ggplot(dailyAllDT, aes(x = sumkWh)) +
geom_histogram() +
labs(caption = "Daily sum of half-hourly data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# still some neg - probably due to PV?
# check if aggregated to household
dt <- dailyAllDT[, .(meankWh = mean(sumkWh),
sdkWh = sd(sumkWh)), keyby = .(linkID)]
ggplot2::ggplot(dt, aes(x = meankWh)) +
geom_histogram() +
labs(caption = "Mean of all half-hourly data per household")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# still some
There are negative kWh values at all levels of aggregation.
Check if the negative values are to do with PV.
# need to check -ve = mid-day, if not is not PV must just be errors?
totalDT[, testValues := "> 0"]
totalDT[, testValues := ifelse(meanConsumptionkWh == 0, "0", testValues)]
totalDT[, testValues := ifelse(meanConsumptionkWh < 0, "< 0", testValues)]
testDT <- totalDT[hhDT[, .(linkID, hasPV)]]
dt <- testDT[testValues == "< 0", .(nValues = .N),
keyby = .(r_halfHour, season, linkID, testValues,hasPV)]
p <- ggplot2::ggplot(dt[!is.na(testValues) & !is.na(season)], aes(x = r_halfHour, y = nValues, colour = linkID)) +
geom_line() +
facet_grid(season ~ testValues + hasPV) +
labs(x = "Half hour",
y = "N",
caption = paste0("All -ve values, Yes = PV reported in survey\n",
"Colours = individual dwellings, legend omitted for clarity")
) +
theme(legend.position="none")
p
Figure 3.3: Count of negative kWh values by time of day and whether reported presence of PV
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:flextable':
##
## style
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plotly::ggplotly(p)
Figure 3.3: Count of negative kWh values by time of day and whether reported presence of PV
so: a) neg values indicate PV - although at least one household who reported having PV (“Yes”) apparently did not (rf_23) b) low levels of negative values where PV is not reported (’No") may indicate monitoring device errors (rf_14 and rf_26)
If we only care about network load we could keep all values > 0 only. If we want total energy input then we need grid draw + PV input which we might be able to calculate from the PV circuit W - grid export but it gets complicated.
So for now we’ll leave out the dwellings which seem to have PV
setkey(dailyAllDT, linkID)
dailyDT <- dailyAllDT[hhDT[hasPV == "No"]]
ggIPF_AU_DT <- data.table::fread(hhWf)
skimr::skim(ggIPF_AU_DT)
| Name | ggIPF_AU_DT |
| Number of rows | 78078 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| AU2013_label | 0 | 1 | 4 | 33 | 0 | 1859 | 0 |
| REGC2013_label | 0 | 1 | 12 | 24 | 0 | 17 | 0 |
| linkID | 0 | 1 | 5 | 6 | 0 | 42 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| AU2013_code | 0 | 1 | 554468.19 | 33426.10 | 500100 | 524711.0 | 551030.00 | 584936.00 | 622201.00 | ▇▇▅▇▅ |
| nMBs | 0 | 1 | 24.75 | 16.54 | 1 | 11.0 | 22.00 | 35.00 | 124.00 | ▇▅▁▁▁ |
| ipfWeight | 0 | 1 | 19.84 | 37.87 | 0 | 3.3 | 9.88 | 20.64 | 1973.35 | ▇▁▁▁▁ |
mbieDF <- readxl::read_xlsx(paste0(rmdParams$mbiePath, "/MBIE_Electricity_extract_2015.xlsx"))
## New names:
## * `` -> ...1
mbieDT <- as.data.table(mbieDF)
setnames(mbieDT, "...1", "Variable")
set_caption(colformat_num(flextable(mbieDT), digits = 2),
caption = "MBIE data extract for 2015")
Variable | Q1 2015 | Q2 2015 | Q3 2015 | Q4 2015 |
Net Generation (GWh)1,2 | 10,160.94 | 10,760.95 | 11,519.74 | 10,598.11 |
Hydro | 5,305.53 | 6,102.22 | 6,809.89 | 6,067.01 |
Geothermal | 1,851.91 | 1,927.84 | 1,882.43 | 1,892.51 |
Biogas | 57.36 | 61.69 | 62.45 | 62.48 |
Wood | 86.70 | 88.60 | 81.73 | 91.71 |
Wind | 459.17 | 619.16 | 587.16 | 675.01 |
Solar3 | 8.81 | 6.11 | 7.67 | 13.55 |
Oil | 0.31 | 0.73 | 0.37 | 0.04 |
Coal | 668.59 | 320.36 | 301.76 | 462.30 |
Gas | 1,710.77 | 1,621.84 | 1,773.87 | 1,321.09 |
Waste Heat4 | 11.80 | 12.40 | 12.40 | 12.40 |
Total Consumption(GWh)5 | 9,513.73 | 10,035.25 | 11,024.87 | 10,025.26 |
Residential consumption (GWh) | 2,427.96 | 3,141.21 | 4,103.25 | 2,900.94 |
Residential Number of ICPs5 | 1,702,555.93 | 1,717,234.06 | 1,725,995.65 | 1,736,972.82 |
Residential Average consumption per ICP (kWh)5 | 1,426.07 | 1,829.23 | 2,377.33 | 1,670.11 |
Mean daily kWh per ICP | 15.85 | 20.32 | 26.41 | 18.56 |
resCons <- mbieDT[Variable %like% "Residential"]
dt <- melt(resCons)
## Warning in melt.data.table(resCons): id.vars and measure.vars are internally
## guessed when both are 'NULL'. All non-numeric/integer/logical type columns are
## considered id.vars, which in this case are columns [Variable, ...]. Consider
## providing at least one of 'id' or 'measure' vars in future.
annualResGWh <- sum(dt[Variable == "Residential consumption (GWh)"]$value)
rmdParams$NZ_nResICPs_MBIE_2015 <- mean(dt[Variable == "Residential Number of ICPs5"]$value)
dailyResMeanTotalGWh <- annualResGWh/365
dailyResMeanKWh <- 1000000*(dailyResMeanTotalGWh/rmdParams$NZ_nResICPs_MBIE_2015)
# NB: Mean daily kWh per ICP = average q total/90 (days)
We use this data below to validate the upscaled results. The quarterly residential consumption figures are used directly and we also calculate:
This data comprises family, household or dwelling counts (depending on the variables of interest) at area unit level. The data was downloaded from the Stats NZ census table download tool and processed using code developed as part of the EU H2020 Marie Skłodowska-Curie scheme-funded SPATIALEC Global Fellowship hosted by the University of Otago.
f <- paste0(rmdParams$censusPath, "nResidents/TABLECODE8169_Data_f4c79560-8c41-4ff9-b861-5598e6e0b50d.csv")
censusDT <- data.table::fread(f)
censusOccupancyDT <- censusDT[Area == "Total, New Zealand by regional council/area unit" &
`Number of usual residents in household` %like% "usual"]
ft <- set_caption(flextable(censusOccupancyDT[Year == "2013", .(`Number of usual residents in household`, n = Value)]), caption = "Number of households by occupancy (2013)")
autofit(ft)
Number of usual residents in household | n |
One usual resident | 355284 |
Two usual residents | 527676 |
Three usual residents | 254862 |
Four usual residents | 235332 |
Five usual residents | 106329 |
Six usual residents | 41184 |
Seven usual residents | 15402 |
Eight or more usual residents | 13824 |
censusOccupancyDT[, nPeople := `Number of usual residents in household`]
censusOccupancyDT[, nPeople_Factor := ifelse(nPeople %like% "One", "1", "4+")]
censusOccupancyDT[, nPeople_Factor := ifelse(nPeople %like% "Two", "2", nPeople_Factor)]
censusOccupancyDT[, nPeople_Factor := ifelse(nPeople %like% "Three", "3", nPeople_Factor)]
#censusOccupancyDT[, .(n = sum(Value)), keyby = .(nPeople_Factor,nPeople)]
rmdParams$NZ_nHHs_Census_2013 <- sum(censusOccupancyDT[Year == "2013"]$Value)
Table 3.5 shows the number of households by household size. We will use this later.
# might not need this
f <- paste0(rmdParams$censusPath, "2013/2013IpfInput.csv")
censusIPFInputAUDT <- data.table::fread(f)
skimr::skim(censusIPFInputAUDT)
| Name | censusIPFInputAUDT |
| Number of rows | 2020 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| AU2013_label | 0 | 1 | 4 | 34 | 0 | 2020 | 0 |
| REGC2013_label | 0 | 1 | 12 | 24 | 0 | 17 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| AU2013_code | 0 | 1.00 | 558310.19 | 36014.20 | 500100 | 526176.50 | 555350.0 | 587302.25 | 666673.00 | ▇▆▇▃▁ |
| nRooms1_4 | 8 | 1.00 | 121.33 | 182.91 | 0 | 24.00 | 72.0 | 165.00 | 4122.00 | ▇▁▁▁▁ |
| nRooms5_6 | 8 | 1.00 | 312.30 | 263.44 | 0 | 84.00 | 261.0 | 489.00 | 1758.00 | ▇▅▁▁▁ |
| nRooms7m | 8 | 1.00 | 342.56 | 285.93 | 0 | 108.00 | 297.0 | 504.75 | 1992.00 | ▇▃▁▁▁ |
| nrooms_statedtotalHouseholds | 8 | 1.00 | 731.16 | 589.34 | 0 | 219.00 | 643.5 | 1131.00 | 4968.00 | ▇▃▁▁▁ |
| nrooms_totalHouseholds | 8 | 1.00 | 776.29 | 623.96 | 0 | 233.25 | 685.5 | 1197.00 | 5568.00 | ▇▃▁▁▁ |
| calcSum | 8 | 1.00 | 776.19 | 624.19 | 0 | 231.00 | 685.5 | 1194.75 | 5562.00 | ▇▃▁▁▁ |
| calcSumPc | 131 | 0.94 | 98.50 | 12.46 | 0 | 99.69 | 100.0 | 100.28 | 137.50 | ▁▁▁▇▁ |
| nPeople_1 | 8 | 1.00 | 176.58 | 168.71 | 0 | 48.00 | 132.0 | 261.00 | 1797.00 | ▇▁▁▁▁ |
| nPeople_2 | 8 | 1.00 | 262.21 | 217.39 | 0 | 84.00 | 228.0 | 390.75 | 2199.00 | ▇▂▁▁▁ |
| nPeople_3 | 8 | 1.00 | 126.64 | 106.58 | 0 | 33.00 | 108.0 | 198.00 | 948.00 | ▇▂▁▁▁ |
| nPeople_4m | 8 | 1.00 | 204.70 | 182.03 | 0 | 54.00 | 168.0 | 306.00 | 1167.00 | ▇▃▁▁▁ |
| npeople_totalHouseholds | 8 | 1.00 | 770.32 | 618.78 | 0 | 231.00 | 681.0 | 1188.75 | 5367.00 | ▇▃▁▁▁ |
| i.calcSum | 8 | 1.00 | 770.13 | 618.81 | 0 | 231.00 | 684.0 | 1191.00 | 5364.00 | ▇▃▁▁▁ |
| i.calcSumPc | 124 | 0.94 | 98.39 | 12.52 | 0 | 99.69 | 100.0 | 100.24 | 125.00 | ▁▁▁▇▃ |
| nKids_1m | 8 | 1.00 | 261.57 | 220.29 | 0 | 69.00 | 222.0 | 402.00 | 1335.00 | ▇▅▂▁▁ |
| nkids_totalFamilies | 8 | 1.00 | 564.81 | 452.99 | 0 | 168.00 | 502.5 | 861.00 | 2460.00 | ▇▅▂▁▁ |
| nKids_0_families | 8 | 1.00 | 303.24 | 247.00 | 0 | 96.00 | 262.5 | 453.00 | 1728.00 | ▇▅▁▁▁ |
| i.calcSum.2 | 8 | 1.00 | 564.81 | 452.99 | 0 | 168.00 | 502.5 | 861.00 | 2460.00 | ▇▅▂▁▁ |
| i.calcSumPc.2 | 134 | 0.93 | 100.00 | 0.00 | 100 | 100.00 | 100.0 | 100.00 | 100.00 | ▁▁▇▁▁ |
| fuel_totalHouseholds | 8 | 1.00 | 776.29 | 623.96 | 0 | 233.25 | 685.5 | 1197.00 | 5568.00 | ▇▃▁▁▁ |
| fuel_totalStatedHouseholds | 8 | 1.00 | 733.37 | 590.50 | 0 | 219.00 | 646.5 | 1131.00 | 4920.00 | ▇▃▁▁▁ |
| heatSourceCoal | 8 | 1.00 | 30.30 | 62.10 | 0 | 6.00 | 15.0 | 30.00 | 930.00 | ▇▁▁▁▁ |
| heatSourceElectricity | 8 | 1.00 | 592.64 | 498.79 | 0 | 156.00 | 492.0 | 918.00 | 3342.00 | ▇▅▁▁▁ |
| heatSourceGas | 8 | 1.00 | 200.97 | 205.12 | 0 | 45.00 | 138.0 | 300.00 | 1371.00 | ▇▂▁▁▁ |
| heatSourceOther | 8 | 1.00 | 76.61 | 98.01 | 0 | 18.00 | 51.0 | 105.75 | 2202.00 | ▇▁▁▁▁ |
| heatSourceWood | 8 | 1.00 | 269.51 | 245.25 | 0 | 96.00 | 213.0 | 375.00 | 1764.00 | ▇▂▁▁▁ |
| i.calcSum.1 | 8 | 1.00 | 1170.03 | 898.84 | 0 | 393.00 | 1071.0 | 1773.75 | 5922.00 | ▇▅▂▁▁ |
| i.calcSumPc.1 | 131 | 0.94 | 154.81 | 25.82 | 0 | 141.04 | 153.8 | 167.98 | 233.33 | ▁▁▂▇▁ |
| nMBs | 0 | 1.00 | 23.09 | 16.90 | 1 | 9.00 | 20.0 | 34.00 | 124.00 | ▇▅▁▁▁ |
| nKids_0 | 8 | 1.00 | 508.75 | 433.10 | 0 | 152.25 | 430.5 | 771.00 | 4929.00 | ▇▁▁▁▁ |
| nkids_totalHouseholds | 8 | 1.00 | 770.32 | 618.78 | 0 | 231.00 | 681.0 | 1188.75 | 5367.00 | ▇▃▁▁▁ |
f <- paste0(rmdParams$censusPath, "2013/censusAu2013IpfInput.csv")
censusIPFInputAU_reducedDT <- data.table::fread(f)
skimr::skim(censusIPFInputAU_reducedDT)
| Name | censusIPFInputAU_reducedD… |
| Number of rows | 2020 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 24 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| AU2013_label | 0 | 1 | 4 | 34 | 0 | 2020 | 0 |
| REGC2013_label | 0 | 1 | 12 | 24 | 0 | 17 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| AU2013_code | 0 | 1.00 | 558310.19 | 36014.20 | 500100 | 526176.50 | 555350.0 | 587302.25 | 666673 | ▇▆▇▃▁ |
| nrooms_totalHouseholds | 8 | 1.00 | 776.29 | 623.96 | 0 | 233.25 | 685.5 | 1197.00 | 5568 | ▇▃▁▁▁ |
| nRooms1_4 | 163 | 0.92 | 5.96 | 24.65 | 0 | 0.00 | 3.0 | 6.00 | 780 | ▇▁▁▁▁ |
| nRooms5_6 | 162 | 0.92 | 136.73 | 110.18 | 0 | 45.00 | 114.0 | 204.00 | 768 | ▇▃▁▁▁ |
| nRooms7m | 162 | 0.92 | 135.55 | 101.55 | 0 | 51.00 | 117.0 | 198.00 | 639 | ▇▅▂▁▁ |
| nrooms_statedtotalHouseholds | 142 | 0.93 | 783.33 | 575.53 | 6 | 297.00 | 702.0 | 1164.00 | 4968 | ▇▃▁▁▁ |
| count.NA | 162 | 0.92 | 48.85 | 46.13 | 0 | 15.00 | 36.0 | 69.00 | 600 | ▇▁▁▁▁ |
| npeople_totalHouseholds | 8 | 1.00 | 770.32 | 618.78 | 0 | 231.00 | 681.0 | 1188.75 | 5367 | ▇▃▁▁▁ |
| nPeople_1 | 155 | 0.92 | 190.49 | 167.50 | 0 | 60.00 | 150.0 | 273.00 | 1797 | ▇▁▁▁▁ |
| nPeople_2 | 157 | 0.92 | 283.19 | 212.37 | 0 | 114.00 | 249.0 | 405.00 | 2199 | ▇▂▁▁▁ |
| nPeople_3 | 161 | 0.92 | 137.06 | 104.23 | 0 | 48.00 | 120.0 | 204.00 | 948 | ▇▃▁▁▁ |
| nPeople_4m | 161 | 0.92 | 126.59 | 100.35 | 0 | 42.00 | 108.0 | 183.00 | 744 | ▇▃▁▁▁ |
| nkids_totalFamilies | 8 | 1.00 | 564.81 | 452.99 | 0 | 168.00 | 502.5 | 861.00 | 2460 | ▇▅▂▁▁ |
| nKids_1m | 197 | 0.90 | 182.94 | 137.34 | 3 | 69.00 | 162.0 | 264.00 | 987 | ▇▅▁▁▁ |
| nKids_0_families | 197 | 0.90 | 440.05 | 307.64 | 6 | 177.00 | 402.0 | 636.00 | 2010 | ▇▆▂▁▁ |
| fuel_totalHouseholds | 8 | 1.00 | 776.29 | 623.96 | 0 | 233.25 | 685.5 | 1197.00 | 5568 | ▇▃▁▁▁ |
| heatSourceElectricity | 148 | 0.93 | 624.43 | 482.04 | 0 | 213.00 | 537.0 | 930.75 | 3279 | ▇▅▁▁▁ |
| heatSourceGas | 163 | 0.92 | 95.18 | 166.51 | 0 | 3.00 | 9.0 | 114.00 | 1110 | ▇▁▁▁▁ |
| heatSourceWood | 156 | 0.92 | 290.91 | 242.28 | 0 | 120.00 | 234.0 | 393.00 | 1764 | ▇▃▁▁▁ |
| heatSourceCoal | 161 | 0.92 | 32.79 | 63.97 | 0 | 6.00 | 15.0 | 33.00 | 930 | ▇▁▁▁▁ |
| heatSourceOther | 163 | 0.92 | 12.38 | 15.10 | 0 | 3.00 | 9.0 | 15.00 | 258 | ▇▁▁▁▁ |
| fuel_totalStatedHouseholds | 142 | 0.93 | 785.70 | 576.58 | 6 | 294.75 | 705.0 | 1170.00 | 4920 | ▇▃▁▁▁ |
| nMBs | 0 | 1.00 | 23.09 | 16.90 | 1 | 9.00 | 20.0 | 34.00 | 124 | ▇▅▁▁▁ |
| nKids_0 | 197 | 0.90 | 673.04 | 488.46 | 12 | 267.00 | 603.0 | 976.50 | 5271 | ▇▂▁▁▁ |
In this example we simply multiply point estimates by the population size. In this case we:
Table 3.7 shows the mean daily kWh and the estimated 95% confidence intervals for this mean calculated from Green Grid sample. The table also shows the same result for a simulated HEEP2 full sample size of 280 to illustrate the effect of increasing sample size (only) on precision c.f (Anderson et al. 2020). In this case we assume that the mean and s.d. are unchanged. As noted in WP1, there is no ‘right’ value for this sample size - it depends what level of (im)precision can be accommodated in the analytic results as will be made clear below.
baseDT <- dailyDT[!is.na(sumkWh), .(mean = mean(sumkWh),
sd = sd(sumkWh),
nSample = uniqueN(linkID))]
baseDT[, ci_lower := mean - (sd/sqrt(nSample))*qnorm(0.975)]
baseDT[, ci_upper := mean + (sd/sqrt(nSample))*qnorm(0.975)]
baseDT[, source := "GreenGrid"]
baseDTHEEP2 <- copy(baseDT)
# simulate HEEP2 full
# really we should do a proper re-sampling etc but...
baseDTHEEP2[, ci_lower := mean - (sd/sqrt(280))*qnorm(0.975)]
baseDTHEEP2[, ci_upper := mean + (sd/sqrt(280))*qnorm(0.975)]
baseDTHEEP2[, source := "Simulated HEEP2 sample size"]
baseDTHEEP2[, nSample := 280]
dt <- rbind(baseDT,baseDTHEEP2)
dt[, ci_range := ci_upper - ci_lower]
ft <- colformat_num(flextable(dt[, .(source, nSample, mean, sd,
ci_lower, ci_upper, ci_range)]), digits = 2, na_str = "N/A")
ft <- colformat_num(ft, j = "nSample", digits = 0, na_str = "N/A")
set_caption(autofit(ft),
caption = "Mean daily household kWh (2015)")
source | nSample | mean | sd | ci_lower | ci_upper | ci_range |
GreenGrid | 33 | 22.76 | 12.66 | 18.44 | 27.08 | 8.64 |
Simulated HEEP2 sample size | 280 | 22.76 | 12.66 | 21.27 | 24.24 | 2.97 |
# for text
pcOverEstimate <- 100*((baseDT$mean/dailyResMeanKWh)-1) # % mean higher than MBIE mean
ci_range <- baseDT$ci_upper - baseDT$ci_lower # CI range
pcOfMean <- ci_range/baseDT$mean
As we can see the Green grid data produces a mean daily kWh value which is ~14% higher than the MBIE derived figure of 20.02. However we can also see that the MBIE value is included within the very large Green Grid 95% CI which range from 18.44 to 27.08 or 9 kWh/day (~ 38 % of the mean)!
On the other hand, while the HEEP2 simulation shows the consequences of increasing sample size on precision (much narrower 95% confidence interval), the MBIE value now lies outside (below) the 95% lower confidence interval limit. We would therefore suspect that the Green grid sample may have higher than average consumption.
Table 3.8 shows the result of scaling this daily mean by the estimated number of New Zealand households (1,721,000). The first row uses the original Green Grid mean and confidence intervals while the second row uses the simulated HEEP2 confidence intervals.
dt[, GWh := (mean * rmdParams$NZ_HHs)/1000000]
dt[, GWh_lower := (ci_lower * rmdParams$NZ_HHs)/1000000]
dt[, GWh_upper := (ci_upper * rmdParams$NZ_HHs)/1000000]
dt[, nNZHHs := rmdParams$NZ_HHs]
dt <- dt[, GWh_CI := GWh_upper - GWh_lower]
ft <- colformat_num(flextable(dt[, .(source, nSample,nNZHHs,
GWh, GWh_lower, GWh_upper,
GWh_CI)]), digits = 2, na_str = "N/A")
ft <- colformat_num(autofit(ft), j = c("nSample", "nNZHHs"),
digits = 0, na_str = "N/A")
set_caption(ft,
caption = "Mean total daily household GWh (scaled, 2015)")
source | nSample | nNZHHs | GWh | GWh_lower | GWh_upper | GWh_CI |
GreenGrid | 33 | 1,721,000 | 39.16 | 31.73 | 46.60 | 14.87 |
Simulated HEEP2 sample size | 280 | 1,721,000 | 39.16 | 36.61 | 41.72 | 5.10 |
As we would expect (second row in Table 3.8), increasing the sample size to 280 increases precision. Nevertheless, the 95% CI of ~ 5 GWh/day is roughly equivalent to the New Zealand daily wind generation of 6.5 GWh in Q3 2015 (c.f. Table 3.4), a level of imprecision that may still preclude use of the estimate for practical purposes. As noted above and as discussed in WP1 report at length, the level of precision required should dictate the sample size required.
To extend the example, we next estimate the same values but by quarter to compare with the MBIE data. Table 3.9 shows the mean daily kWh per household by season. NA values indicate the households for whom no electricity monitoring data exists. The table also shows the lower and upper 95% confidence intervals calculated using the sample sd and n.
For validation purposes we use the quarterly per ICP kWh consumption values for 2015 (see Table 3.4):
These have been converted to mean daily values for comparison purposes as shown in Table 3.9 and Figure 3.4.
Figure 3.4 shows the Green grid estimates (columns) together with their 95% confidence intervals (error bars) and the quarterly MBIE values (dots). We can see that the MBIE derived values are a close match to the Green Grid point estimates for Q1 and Q4 and lie in the centre of the Green Grid 95% confidence intervals. However the Green Grid data appears to over-estimate consumption in Q2 and Q3 which correspond to the New Zealand heating season. This is particularly noticeable for Q2 where the MBIE value is only just within the Green Grid 95% confidence interval. Once again, this reflects the Green grid sample bias towards family households with higher than average electricity demand in winter.
dailyDT[, q := ifelse(r_date >= lubridate::as_date("2015-04-01"), "Q2", "Q1")]
dailyDT[, q := ifelse(r_date >= lubridate::as_date("2015-07-01"), "Q3", q)]
dailyDT[, q := ifelse(r_date >= lubridate::as_date("2015-10-01"), "Q4", q)]
dailyDT[, r_month := lubridate::month(r_date)]
#table(dailyDT$q, dailyDT$r_month)
qDT <- dailyDT[, .(mean = mean(sumkWh),
sd = sd(sumkWh),
n = uniqueN(linkID)), keyby = q]
qDT[, ci_lower := mean - (sd/sqrt(n))*qnorm(0.975)]
qDT[, ci_upper := mean + (sd/sqrt(n))*qnorm(0.975)]
# a quarter is ~ 90 days but...
qDT[q == "Q1", mbie_kWh := 1426/as.double(difftime("2015-04-01","2015-01-01"))]
qDT[q == "Q2", mbie_kWh := 1829/as.double(difftime("2015-07-01","2015-04-01"))]
qDT[q == "Q3", mbie_kWh := 2377/as.double(difftime("2015-10-01","2015-07-01"))]
qDT[q == "Q4", mbie_kWh := 1670/as.double(difftime("2015-12-31","2015-10-01"))]
ft <- colformat_num(flextable(qDT), digits = 2, na_str = "N/A")
ft <- colformat_num(ft, j = "n", digits = 0, na_str = "N/A")
set_caption(ft,
caption = "Mean daily household kWh by quarter (2015)")
q | mean | sd | n | ci_lower | ci_upper | mbie_kWh |
N/A | N/A | N/A | 7 | N/A | N/A | N/A |
Q1 | 16.56 | 7.85 | 32 | 13.84 | 19.28 | 15.85 |
Q2 | 24.40 | 12.87 | 32 | 19.94 | 28.86 | 20.10 |
Q3 | 28.36 | 14.44 | 29 | 23.11 | 33.62 | 25.84 |
Q4 | 18.93 | 9.50 | 28 | 15.41 | 22.45 | 18.34 |
ggplot2::ggplot(qDT, aes(x = q, y = mean, fill = q)) +
geom_col() +
scale_fill_discrete(name = "Quarter") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), colour = "grey50") +
geom_point(aes(y = mbie_kWh))
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_point).
Figure 3.4: Quarterly mean daily kWh
Table 3.10 shows the results of multiplying each of the point estimates (the mean) and the upper and lower confidence intervals by the population size (1,796,000) to estimate the total daily electricity consumption for all New Zealand households by quarter. As would be expected given Table 3.9, the uncertainty in the estimates is very large with the 95% confidence interval of ~19 GWh in Q3 roughly equivalent to the daily output of all New Zealand geothermal generation in Q3 2015 (~ 20 GWh/day - see MBIE data Table 1) or about 16% of total net generation. This level of uncertainty would mean that generalisation for policy purposes would be far from robust.
qDT[, scaledSum := mean * rmdParams$NZ_HHs]
qDT[, scaledCI_lo := ci_lower * rmdParams$NZ_HHs]
qDT[, scaleduCI_hi := ci_upper * rmdParams$NZ_HHs]
t <- qDT[, .(q, n, GWh = scaledSum/1000000,
GWh_lower = scaledCI_lo/1000000,
GWh_upper = scaleduCI_hi/1000000)]
t <- t[, CI := GWh_upper - GWh_lower]
ft <- flextable(t[!is.na(q), .(q, sampleN = n, popN = rmdParams$NZ_HHs, GWh, GWh_lower, GWh_upper, CI)])
ft <- colformat_num(ft, digits = 2)
ft <- colformat_num(ft, j = c("sampleN", "popN"), digits = 0, na_str = "N/A")
set_caption(ft,
caption = "Mean total daily household GWh by quarter (scaled, 2015)")
q | sampleN | popN | GWh | GWh_lower | GWh_upper | CI |
Q1 | 32 | 1,721,000 | 28.50 | 23.82 | 33.18 | 9.36 |
Q2 | 32 | 1,721,000 | 42.00 | 34.32 | 49.67 | 15.35 |
Q3 | 29 | 1,721,000 | 48.81 | 39.76 | 57.86 | 18.09 |
Q4 | 28 | 1,721,000 | 32.58 | 26.52 | 38.63 | 12.12 |
As before, we have re-calculated the 95% confidence intervals using n = 280 to simulate the potential precision for the HEEP2 full sample. Table 3.11 shows the results of doing so - the uncertainty in Q4 is now reduced to a 95% confidence interval of ~ 6 GWh, roughly equivalent to the daily wind generation of 7.5 GWh in Q4 2015. While the precision has improved, there is still substantial uncertainty illustrating the difficulty of making national level estimates from relatively small samples.
qDT[, ci_lower_HEEP2 := mean - (sd/sqrt(280))*qnorm(0.975)]
qDT[, ci_upper_HEEP2 := mean + (sd/sqrt(280))*qnorm(0.975)]
qDT[, scaledSum := mean * rmdParams$NZ_HHs]
qDT[, scaledlCI_lo := ci_lower_HEEP2 * rmdParams$NZ_HHs]
qDT[, scaleduCI_hi := ci_upper_HEEP2 * rmdParams$NZ_HHs]
t <- qDT[, .(q, GWh = scaledSum/1000000,
GWh_lower = scaledlCI_lo/1000000,
GWh_upper = scaleduCI_hi/1000000)]
t <- t[, CI := GWh_upper - GWh_lower]
ft <- flextable(t[!is.na(q), .(q, sampleN = 280, popN = rmdParams$NZ_HHs,
GWh, GWh_lower, GWh_upper, CI)])
ft <- colformat_num(ft, digits = 2)
ft <- colformat_num(ft, j = c("sampleN","popN"), digits = 0, na_str = "N/A")
set_caption(ft,
caption = "Simulated mean total daily household GWh by quarter (n = 280, scaled, 2015)")
q | sampleN | popN | GWh | GWh_lower | GWh_upper | CI |
Q1 | 280 | 1,721,000 | 28.50 | 26.92 | 30.09 | 3.16 |
Q2 | 280 | 1,721,000 | 42.00 | 39.40 | 44.59 | 5.19 |
Q3 | 280 | 1,721,000 | 48.81 | 45.90 | 51.72 | 5.82 |
Q4 | 280 | 1,721,000 | 32.58 | 30.66 | 34.49 | 3.83 |
In this example we adjust the daily kWh to account for the distribution of the number of dwelling occupants (1,2,3,4+). To do this we need to know the proportion of these households in both the Green Grid sample and the closest (2013) NZ census. Note that the total number of households in 2013 was 1,549,875 rather than the 2015 estimate of 1,796,000 as used above.
Table 3.12 shows the distribution of households of different sizes in the Green grid 2015 sample compared to the 2013 Census data. This table clearly shows that the Green grid sample under-represents single person households and over-represents 3-4+ person households. Given that larger households tend to use more electricity (see mean kWh column) adjusting for these over-representations may reduce the bias in the sample mean and thus in a scaled population estimate.
# GG data
occDT <- dailyAllDT[hhDT][!is.na(nPeople_Factor) &
!is.na(season), .(nHHs_GG = uniqueN(linkID),
meanDailykWh = mean(sumkWh),
sd = sd(sumkWh)), keyby = .(nPeople = nPeople_Factor)]
occDT[, pc_HHS_GG := 100*(nHHs_GG/sum(nHHs_GG))]
# census data
c_occDT <- censusOccupancyDT[Year == 2013, .(nHHs_census = sum(Value)), keyby = .(nPeople = nPeople_Factor)]
c_occDT[, pc_HHS_census := 100*(nHHs_census/sum(nHHs_census))]
setkey(occDT, nPeople)
setkey(c_occDT, nPeople)
occDT <- occDT[c_occDT]
ft <- set_caption(flextable(occDT[,.(nPeople, nHHs_GG, pc_HHS_GG, meanDailykWh, nHHs_census, pc_HHS_census)]),
caption = "Mean daily kWh by household size and distribution of household sizes (Green Grid 2015, Census 2013)")
ft <- colformat_num(ft, digits = 2)
colformat_num(ft, j = c("nHHs_GG", "nHHs_census"), digits = 0, na_str = "N/A")
nPeople | nHHs_GG | pc_HHS_GG | meanDailykWh | nHHs_census | pc_HHS_census |
1 | 3 | 8.57 | 8.45 | 355,284 | 22.92 |
2 | 10 | 28.57 | 22.28 | 527,676 | 34.05 |
3 | 7 | 20.00 | 23.53 | 254,862 | 16.44 |
4+ | 15 | 42.86 | 23.30 | 412,071 | 26.59 |
Table 3.13 shows the mean daily kWh by household size together with their 95% confidence intervals confidence intervals. We can see that these are extremely wide due to the small counts within each occupancy category. Indeed the 95% CI for the single person households spans 0… As above, we then simulate the 95% CI using the HEEP2 full sample size of 280 to show how the uncertainty would reduce with a larger sample size.
occDT[, ci_lower := meanDailykWh - (sd/sqrt(nHHs_GG))*qnorm(0.975)]
occDT[, ci_upper := meanDailykWh + (sd/sqrt(nHHs_GG))*qnorm(0.975)]
# distribute the HEEP2 280 according to census %
occDT[, nHEEP2 := 280 * pc_HHS_census/100]
occDT[, ci_lowerHEEP2 := meanDailykWh - (sd/sqrt(nHEEP2))*qnorm(0.975)]
occDT[, ci_upperHEEP2 := meanDailykWh + (sd/sqrt(nHEEP2))*qnorm(0.975)]
ft <- set_caption(flextable(occDT[, .(nPeople, nHHs_GG, meanDailykWh,
sd, ci_lower, ci_upper,
ci_lowerHEEP2, ci_upperHEEP2)]),
caption = "Mean daily kWh by household size (Green Grid, 2015)")
ft <- colformat_num(ft, digits = 2)
colformat_num(ft, j = c("nHHs_GG"), digits = 0, na_str = "N/A")
nPeople | nHHs_GG | meanDailykWh | sd | ci_lower | ci_upper | ci_lowerHEEP2 | ci_upperHEEP2 |
1 | 3 | 8.45 | 23.18 | -17.79 | 34.68 | 2.77 | 14.12 |
2 | 10 | 22.28 | 10.95 | 15.50 | 29.07 | 20.09 | 24.48 |
3 | 7 | 23.53 | 15.64 | 11.94 | 35.11 | 19.01 | 28.04 |
4+ | 15 | 23.30 | 12.79 | 16.83 | 29.78 | 20.40 | 26.21 |
Repeating the multiplicative scaling process now requires us to multiple the mean daily values by the number of households in each occupancy category.
occDT[, dailyGWh := (nHHs_census * meanDailykWh)/1000000]
occDT[, scaledggCI_lo := (ci_lower * nHHs_census)/1000000]
occDT[, scaledggCI_hi := (ci_upper * nHHs_census)/1000000]
occDT[, scaledHEEP2CI_lo := (ci_lowerHEEP2 * nHHs_census)/1000000]
occDT[, scaledHEEP2CI_hi := (ci_upperHEEP2 * nHHs_census)/1000000]
occDT[, pc := 100*(dailyGWh/sum(occDT$dailyGWh))]
ft <- set_caption(flextable(occDT[, .(nPeople, nHHs_GG, nHHs_census, dailyGWh,
"% of total" = pc,
scaledggCI_lo, scaledggCI_hi,
nHHs_HEEP2_full = nHEEP2, scaledHEEP2CI_lo,scaledHEEP2CI_hi)]),
caption = "Scaled total daily GWh by household size (2013 Census)")
ft <- colformat_num(ft, big.mark=",")
colformat_num(ft, j = c("nHHs_GG", "nHHs_census", "nHHs_HEEP2_full"), digits = 0)
nPeople | nHHs_GG | nHHs_census | dailyGWh | % of total | scaledggCI_lo | scaledggCI_hi | nHHs_HEEP2_full | scaledHEEP2CI_lo | scaledHEEP2CI_hi |
1 | 3 | 355,284 | 3.00 | 9.88 | -6.32 | 12.32 | 64 | 0.99 | 5.02 |
2 | 10 | 527,676 | 11.76 | 38.73 | 8.18 | 15.34 | 95 | 10.60 | 12.92 |
3 | 7 | 254,862 | 6.00 | 19.75 | 3.04 | 8.95 | 46 | 4.84 | 7.15 |
4+ | 15 | 412,071 | 9.60 | 31.63 | 6.93 | 12.27 | 74 | 8.40 | 10.80 |
Aggregating the total GWh values across all household types after scaling in this way gives a mean daily residential GWh value of 30.4 GWh, compared to the 34.45 GWh MBIE value. This is likely to be caused by the use of the Census 2013 household counts (total households = 1,549,893) which are ~ 10% lower than both the ICP counts in the 2015 MBIE data (1,720,690) and the projected 2015 household counts (1,721,000) used in Section 3.2.1.
Weights created using IPF method and NZ Census 2013 area unit data as part of SPATIALEC project. Aggregate these to give ‘grossing’ weights for each Green Grid household. Then create a fractional weight so they sum to 1. Use weighted survey stats to produce estimates.
ggIPF_DT <- ggIPF_AU_DT[, .(grossWeight = sum(ipfWeight)), keyby = .(linkID)]
# now we need a fractional weight so sum to 1 across linkIDs
ggIPF_DT[, fracWeight := grossWeight/sum(ggIPF_DT$grossWeight)]
# link them
surveyData_DT <- ggIPF_DT[hhDT]
# % sum error
pcError <- 100*(1-sum(ggIPF_DT$grossWeight)/rmdParams$NZ_nHHs_Census_2013)
The grossing weights should sum to a close estimate of the number of Census households:
Difference: 488.971 or 0.03%
With such a small sample we will have a number of dwellings with identical weights. We will also have a few households with large weights.
summary(surveyData_DT$fracWeight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.003024 0.013480 0.013480 0.023810 0.024044 0.154906 4
dt <- ggIPF_DT[, .(n = .N), keyby = .(fracWeight = round(fracWeight,4), grossWeight = round(grossWeight, 4))]
ft <- flextable(dt)
set_caption(ft, caption = "Number of Green Grid households by fractional weight")
fracWeight | grossWeight | n |
0.0030 | 4684.929 | 2 |
0.0071 | 10975.493 | 6 |
0.0097 | 14998.792 | 1 |
0.0135 | 20886.696 | 19 |
0.0240 | 37254.174 | 6 |
0.0353 | 54636.578 | 1 |
0.0381 | 59059.202 | 1 |
0.0562 | 87095.568 | 2 |
0.0584 | 90514.934 | 2 |
0.0838 | 129880.847 | 1 |
0.1549 | 240012.522 | 1 |
sum(dt$fracWeight*dt$n)
## [1] 1.0001
sum(dt$grossWeight*dt$n)
## [1] 1549404
sum(dt$n)
## [1] 42
# let's plot the fractional weights just to illustrate the problem
ggplot2::ggplot(ggIPF_DT, aes(x = fracWeight)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
why do we get se = 0 for 1 person households? check weights :-)
setkey(dailyDT, linkID)
setkey(ggIPF_DT, linkID)
dailyWtDT <- dailyDT[ggIPF_DT] # will drop the ones with no weight
dailyWtDT <- dailyWtDT[!is.na(sumkWh) & !is.na(grossWeight)]
svy_dailyFracWt <- survey::svydesign(id = ~linkID, weights = ~fracWeight,
data = dailyWtDT)
# Fractional weights
sm <- svymean(~sumkWh, svy_dailyFracWt)
sm
## mean SE
## sumkWh 22.062 2.1003
confint(sm)
## 2.5 % 97.5 %
## sumkWh 17.94505 26.17798
smNP <- svyby(~sumkWh, by = ~nPeople_Factor, svy_dailyFracWt, svymean, vartype=c("se","ci"))
smNP
## nPeople_Factor sumkWh se ci_l ci_u
## 1 1 16.90225 1.790235e-15 16.90225 16.90225
## 2 2 20.32065 2.686547e+00 15.05511 25.58618
## 3 3 25.56045 3.502871e+00 18.69495 32.42595
## 4+ 4+ 23.77514 2.540849e+00 18.79517 28.75511
# why do we get se = 0 for 1 person households? check weights :-)
ft <- flextable(smNP)
set_caption(colformat_num(ft, digits = 2),
caption = "Mean daily kWh by size of household")
nPeople_Factor | sumkWh | se | ci_l | ci_u |
1 | 16.90 | 0.00 | 16.90 | 16.90 |
2 | 20.32 | 2.69 | 15.06 | 25.59 |
3 | 25.56 | 3.50 | 18.69 | 32.43 |
4+ | 23.78 | 2.54 | 18.80 | 28.76 |
smNP_DT <- as.data.table(smNP)
smNP_DT[, nPeople := nPeople_Factor]
setkey(smNP_DT, nPeople)
smNP_DT <- smNP_DT[c_occDT]
smNP_DT[, meanDailyGWh := (sumkWh * nHHs_census)/1000000]
smNP_DT[, ci_lower_GWh := (ci_l *nHHs_census)/1000000]
smNP_DT[, ci_upper_GWh := (ci_u *nHHs_census)/1000000]
sum(smNP_DT$meanDailyGWh)
## [1] 33.03925
ft <- flextable(smNP_DT[,.(nPeople_Factor, meanDailyGWh,
ci_lower_GWh, ci_upper_GWh)])
set_caption(colformat_num(ft, digits = 2),
caption = "Mean daily GWh by size of household")
nPeople_Factor | meanDailyGWh | ci_lower_GWh | ci_upper_GWh |
1 | 6.01 | 6.01 | 6.01 |
2 | 10.72 | 7.94 | 13.50 |
3 | 6.51 | 4.76 | 8.26 |
4+ | 9.80 | 7.74 | 11.85 |
Try the gross weight - the mean should be the same.
svy_dailyGrossWt <- survey::svydesign(id = ~linkID, weights = ~grossWeight,
data = dailyWtDT)
sm_gw <- svymean(~sumkWh, svy_dailyGrossWt)
confint(sm_gw)
## 2.5 % 97.5 %
## sumkWh 17.94505 26.17798
message("Can we use svytotal? Only if there are no missing days")
## Can we use svytotal? Only if there are no missing days
nDays <- dailyDT[, .(nDays = .N), keyby = linkID]
summary(nDays$nDays)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 162.0 279.0 231.6 349.2 365.0
message("There are missing days if the mean is not 365")
## There are missing days if the mean is not 365
Identify key attributes of the ‘groups’ with very large weights.
temp_wdt <- ggIPF_DT[hhDT]
temp_wdt[, gWtFactor := as.factor(round(grossWeight, 0))]
t <- with(temp_wdt, table(gWtFactor, nPeople))
kableExtra::kable(t, caption = "N people by grossWeight") %>% kable_styling()
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| 4685 | 0 | 2 | 0 | 0 | 0 | 0 |
| 10975 | 0 | 0 | 6 | 0 | 0 | 0 |
| 14999 | 0 | 0 | 0 | 1 | 0 | 0 |
| 20887 | 0 | 0 | 0 | 13 | 5 | 1 |
| 37254 | 0 | 6 | 0 | 0 | 0 | 0 |
| 54637 | 0 | 1 | 0 | 0 | 0 | 0 |
| 59059 | 0 | 0 | 1 | 0 | 0 | 0 |
| 87096 | 2 | 0 | 0 | 0 | 0 | 0 |
| 90515 | 2 | 0 | 0 | 0 | 0 | 0 |
| 129881 | 0 | 0 | 1 | 0 | 0 | 0 |
| 240013 | 0 | 1 | 0 | 0 | 0 | 0 |
t <- with(temp_wdt, table(gWtFactor, nAdults))
kableExtra::kable(t, caption = "N adults by grossWeight") %>% kable_styling()
| 1 | 2 | 3 | |
|---|---|---|---|
| 4685 | 2 | 0 | 0 |
| 10975 | 1 | 5 | 0 |
| 14999 | 0 | 1 | 0 |
| 20887 | 0 | 16 | 3 |
| 37254 | 0 | 6 | 0 |
| 54637 | 0 | 1 | 0 |
| 59059 | 0 | 1 | 0 |
| 87096 | 2 | 0 | 0 |
| 90515 | 2 | 0 | 0 |
| 129881 | 0 | 0 | 1 |
| 240013 | 0 | 1 | 0 |
t <- with(temp_wdt, table(gWtFactor, hasPV, useNA = "always"))
kableExtra::kable(t, caption = "N have PV by grossWeight") %>% kable_styling()
| No | yes | NA | |
|---|---|---|---|
| 4685 | 2 | 0 | 0 |
| 10975 | 6 | 0 | 0 |
| 14999 | 1 | 0 | 0 |
| 20887 | 17 | 2 | 0 |
| 37254 | 6 | 0 | 0 |
| 54637 | 1 | 0 | 0 |
| 59059 | 1 | 0 | 0 |
| 87096 | 0 | 2 | 0 |
| 90515 | 2 | 0 | 0 |
| 129881 | 1 | 0 | 0 |
| 240013 | 1 | 0 | 0 |
| NA | 2 | 0 | 2 |
t <- with(temp_wdt, table(gWtFactor, `Heat pump number`, useNA = "always"))
kableExtra::kable(t, caption = "N heat pumps by grossWeight") %>% kable_styling()
| 1 | 3 | NA | |
|---|---|---|---|
| 4685 | 1 | 0 | 1 |
| 10975 | 4 | 0 | 2 |
| 14999 | 0 | 0 | 1 |
| 20887 | 14 | 1 | 4 |
| 37254 | 2 | 0 | 4 |
| 54637 | 0 | 0 | 1 |
| 59059 | 0 | 0 | 1 |
| 87096 | 0 | 1 | 1 |
| 90515 | 0 | 0 | 2 |
| 129881 | 1 | 0 | 0 |
| 240013 | 0 | 0 | 1 |
| NA | 1 | 0 | 3 |
t <- with(temp_wdt, table(gWtFactor, Q7labAgg, useNA = "always"))
kableExtra::kable(t, caption = "Age of dwelling by grossWeight") %>% kable_styling()
| < 1999 | >= 2000 | NA | |
|---|---|---|---|
| 4685 | 2 | 0 | 0 |
| 10975 | 4 | 1 | 1 |
| 14999 | 0 | 1 | 0 |
| 20887 | 14 | 5 | 0 |
| 37254 | 3 | 2 | 1 |
| 54637 | 1 | 0 | 0 |
| 59059 | 1 | 0 | 0 |
| 87096 | 1 | 1 | 0 |
| 90515 | 1 | 1 | 0 |
| 129881 | 0 | 1 | 0 |
| 240013 | 1 | 0 | 0 |
| NA | 1 | 1 | 2 |
Develop a simple daily kWh model
hist(dailyDT$sumkWh)
# skewed (of course)
dailyWinterAggDT <- dailyDT[q == "Q2" | q == "Q3", .(meanDailykWh = mean(sumkWh, na.rm = TRUE)), keyby = linkID]
hist(dailyWinterAggDT$meanDailykWh)
# still skewed (of course)
# test with a qq plot
dailyWinterAggDT[, logkWh := log(meanDailykWh)]
qqnorm(dailyWinterAggDT$logkWh); qqline(dailyWinterAggDT$logkWh, col = 2)
shapiro.test(dailyWinterAggDT$logkWh)
##
## Shapiro-Wilk normality test
##
## data: dailyWinterAggDT$logkWh
## W = 0.93879, p-value = 0.06918
# if p > 0.05 => normal
# is it? Beware: shapiro-wilks is less robust as N ->
setkey(dailyWinterAggDT,linkID)
modelDT <- dailyWinterAggDT[hhDT[, .(linkID, nPeople_Factor, Q7labAgg, Location, Q20_coded)]]
skimr::skim(modelDT)
| Name | modelDT |
| Number of rows | 46 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| linkID | 0 | 1.00 | 5 | 6 | 0 | 46 | 0 |
| nPeople_Factor | 4 | 0.91 | 1 | 2 | 0 | 4 | 0 |
| Q7labAgg | 4 | 0.91 | 6 | 7 | 0 | 2 | 0 |
| Location | 2 | 0.96 | 8 | 10 | 0 | 2 | 0 |
| Q20_coded | 2 | 0.96 | 0 | 31 | 2 | 9 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| meanDailykWh | 14 | 0.7 | 26.30 | 10.94 | 13.98 | 18.26 | 21.04 | 32.63 | 53.37 | ▇▂▃▂▁ |
| logkWh | 14 | 0.7 | 3.19 | 0.39 | 2.64 | 2.90 | 3.05 | 3.49 | 3.98 | ▇▇▅▅▂ |
modelDT[, hasElecHeat := ifelse(Q20_coded == "Heat pump" |
Q20_coded == "HRV or other ventilation system" |
Q20_coded == "Portable electric heaters", "Yes", "No")]
m <- lm(logkWh ~ Location + Q7labAgg +
hasElecHeat + nPeople_Factor, data = modelDT)
summary(m)
##
## Call:
## lm(formula = logkWh ~ Location + Q7labAgg + hasElecHeat + nPeople_Factor,
## data = modelDT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6157 -0.3327 0.0000 0.2972 0.7529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.03093 0.52565 5.766 8.43e-06 ***
## LocationTaranaki -0.07109 0.21252 -0.334 0.741
## Q7labAgg>= 2000 -0.08899 0.20134 -0.442 0.663
## hasElecHeatYes 0.02884 0.21268 0.136 0.893
## nPeople_Factor2 0.19670 0.51403 0.383 0.706
## nPeople_Factor3 0.35621 0.54700 0.651 0.522
## nPeople_Factor4+ 0.19336 0.53829 0.359 0.723
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4394 on 22 degrees of freedom
## (17 observations deleted due to missingness)
## Multiple R-squared: 0.06954, Adjusted R-squared: -0.1842
## F-statistic: 0.274 on 6 and 22 DF, p-value: 0.9432
set_caption(colformat_num(flextable(broom::tidy(m)), digits = 3),
caption = "Model predicting log(mean daily kWh) 2015")
term | estimate | std.error | statistic | p.value |
(Intercept) | 3.031 | 0.526 | 5.766 | 0.000 |
LocationTaranaki | -0.071 | 0.213 | -0.334 | 0.741 |
Q7labAgg>= 2000 | -0.089 | 0.201 | -0.442 | 0.663 |
hasElecHeatYes | 0.029 | 0.213 | 0.136 | 0.893 |
nPeople_Factor2 | 0.197 | 0.514 | 0.383 | 0.706 |
nPeople_Factor3 | 0.356 | 0.547 | 0.651 | 0.522 |
nPeople_Factor4+ | 0.193 | 0.538 | 0.359 | 0.723 |
Not a great model: -ve adjusted r sq is a strong indicator of this
Test the model - diagnostics
message("# Diagnostic plots")
## # Diagnostic plots
plot(m)
## Warning: not plotting observations with leverage one:
## 6
message("# Normality of residuals")
## # Normality of residuals
hist(m$residuals)
qqnorm(m$residuals); qqline(m$residuals, col = 2)
shapiro.test(m$residuals)
##
## Shapiro-Wilk normality test
##
## data: m$residuals
## W = 0.94209, p-value = 0.1137
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
message("# autocorrelation/independence of errors")
## # autocorrelation/independence of errors
car::durbinWatsonTest(m)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.09877985 2.105069 0.99
## Alternative hypothesis: rho != 0
# if p < 0.05 then a problem as implies autocorrelation
message("# homoskedasticity: formal test")
## # homoskedasticity: formal test
car::ncvTest(m)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.10365, Df = 1, p = 0.29347
# if p > 0.05 then there is heteroskedasticity
message("# -> collinearity")
## # -> collinearity
car::vif(m)
## GVIF Df GVIF^(1/(2*Df))
## Location 1.678126 1 1.295425
## Q7labAgg 1.303368 1 1.141651
## hasElecHeat 1.454321 1 1.205952
## nPeople_Factor 2.149870 3 1.136062
# if any values > 10 -> problem
message("# -> tolerance")
## # -> tolerance
1/car::vif(m)
## GVIF Df GVIF^(1/(2*Df))
## Location 0.5959028 1.0000000 0.7719474
## Q7labAgg 0.7672431 1.0000000 0.8759241
## hasElecHeat 0.6876061 1.0000000 0.8292202
## nPeople_Factor 0.4651444 0.3333333 0.8802337
# if any values < 0.2 -> possible problem
# if any values < 0.1 -> definitely a problem
#save the residuals via broom
library(broom)
resids <- broom::augment(m)
# plot fitted vs residuals
ggplot(resids, aes(x = .fitted, y = .std.resid)) +
geom_point(size = 1)
## Warning: Removed 1 rows containing missing values (geom_point).
message("Where there any NAs where the model failed?")
## Where there any NAs where the model failed?
summary(resids$.std.resid)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.497679 -0.918058 -0.075575 0.005251 0.799830 1.947064 1
Overall, not a very good model. But it serves the purpose…
To aid reproducibility:
R.version
## _
## platform x86_64-apple-darwin17.0
## arch x86_64
## os darwin17.0
## system x86_64, darwin17.0
## status
## major 4
## minor 0.2
## year 2020
## month 06
## day 22
## svn rev 78730
## language R
## version.string R version 4.0.2 (2020-06-22)
## nickname Taking Off Again
Additional CRAN packages used:
# this will print out the packages we loaded earlier and set up bibtex references
# it requires a bib file to exist (see yaml) and the references to be in this format
for(p in cranPackages){
cat(" * ", p, " [@",p,"]\n", sep = "")
}
Other packages used:
# this will print out the packages we loaded earlier and set up bibtex references
# it requires a bib file to exist (see yaml) and the references to be in this format
for(p in githubPackages){
cat(" * ", p, " [@",p,"]\n", sep = "")
}
Anderson, Ben. 2018. DkUtils: A Bunch of Useful Little Functions. https://github.com/dataknut/dkUtils.
Anderson, Ben, and David Eyers. 2018. GREENGridData: Processing Nz Green Grid Project Data to Create a ’Safe’ Version for Data Archiving and Re-Use. https://github.com/CfSOtago/GREENGridData.
Anderson, Ben, David Eyers, Rebecca Ford, Diana Giraldo Ocampo, Rana Peniamina, Janet Stephenson, Kiti Suomalainen, Lara Wilcocks, and Michael Jack. 2018. “New Zealand GREEN Grid Household Electricity Demand Study 2014-2018,” September. https://doi.org/10.5255/UKDA-SN-853334.
Anderson, Ben, Tom Rushby, Abubakr Bahaj, and Patrick James. 2020. “Ensuring Statistics Have Power: Guidance for Designing, Reporting and Acting on Electricity Demand Reduction and Behaviour Change Programs.” Energy Research & Social Science 59 (January): 101260. https://doi.org/10.1016/j.erss.2019.101260.
Champely, Stephane. 2018. Pwr: Basic Functions for Power Analysis. https://CRAN.R-project.org/package=pwr.
Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.
Gohel, David. 2020. Flextable: Functions for Tabular Reporting. https://CRAN.R-project.org/package=flextable.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.
Lumley, Thomas. 2020. “Survey: Analysis of Complex Survey Samples.”
Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
———. 2018. Hms: Pretty Time of Day. https://CRAN.R-project.org/package=hms.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
Wickham, Hadley, and Jennifer Bryan. 2019. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.
Zhu, Hao. 2018. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.